{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "punL79CN7Ox6"
      },
      "source": [
        "##### Copyright 2020 The TensorFlow Authors."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "_ckMIh7O7s6D"
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hAclqSm3OOml"
      },
      "source": [
        "# Using LSTMs, CNNs, GRUs with a larger dataset"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "S5Uhzt6vVIB2"
      },
      "source": [
        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/examples/blob/master/courses/udacity_intro_to_tensorflow_for_deep_learning/l10c02_nlp_multiple_models_for_predicting_sentiment.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://github.com/tensorflow/examples/blob/master/courses/udacity_intro_to_tensorflow_for_deep_learning/l10c02_nlp_multiple_models_for_predicting_sentiment.ipynb\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on GitHub</a>\n",
        "  </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fentd-GnIj-j"
      },
      "source": [
        "In this colab, you use different kinds of layers to see how they affect the model.\n",
        "\n",
        "You will use the glue/sst2 dataset, which is available through tensorflow_datasets. \n",
        "\n",
        "The General Language Understanding Evaluation (GLUE) benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems.\n",
        "\n",
        "These resources include the Stanford Sentiment Treebank (SST) dataset that consists of sentences from movie reviews and human annotations of their sentiment. This colab uses version 2 of the SST dataset.\n",
        "\n",
        "The splits are:\n",
        "\n",
        "*   train\t67,349\n",
        "*   validation\t872\n",
        "\n",
        "\n",
        "and the column headings are:\n",
        "\n",
        "*   sentence\n",
        "*   label\n",
        "\n",
        "\n",
        "For more information about the dataset, see [https://www.tensorflow.org/datasets/catalog/glue#gluesst2](https://www.tensorflow.org/datasets/catalog/glue#gluesst2)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "L62G7LTwNzoD"
      },
      "outputs": [],
      "source": [
        "import tensorflow as tf\n",
        "import tensorflow_datasets as tfds\n",
        "\n",
        "import numpy as np"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "o2P6xdtIJMyc"
      },
      "source": [
        "# Get the dataset\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "nCOtiRJZbxCH"
      },
      "outputs": [],
      "source": [
        "# Get the dataset.\n",
        "# It has 70000 items, so might take a while to download\n",
        "dataset, info = tfds.load('glue/sst2', with_info=True)\n",
        "print(info.features)\n",
        "print(info.features[\"label\"].num_classes)\n",
        "print(info.features[\"label\"].names)\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "yBMPhIdStAe2"
      },
      "outputs": [],
      "source": [
        "# Get the training and validation datasets\n",
        "dataset_train, dataset_validation = dataset['train'], dataset['validation']\n",
        "dataset_train"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "jbZ-faiNWu1U"
      },
      "outputs": [],
      "source": [
        "# Print some of the entries\n",
        "for example in dataset_train.take(2):  \n",
        "  review, label = example[\"sentence\"], example[\"label\"]\n",
        "  print(\"Review:\", review)\n",
        "  print(\"Label: %d \\n\" % label.numpy())"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_fVZItTeZSbL"
      },
      "outputs": [],
      "source": [
        "# Get the sentences and the labels\n",
        "# for both the training and the validation sets\n",
        "training_reviews = []\n",
        "training_labels = []\n",
        " \n",
        "validation_reviews = []\n",
        "validation_labels = []\n",
        "\n",
        "# The dataset has 67,000 training entries, but that's a lot to process here!\n",
        "\n",
        "# If you want to take the entire dataset: WARNING: takes longer!!\n",
        "# for item in dataset_train.take(-1):\n",
        "\n",
        "# Take 10,000 reviews\n",
        "for item in dataset_train.take(10000):\n",
        "  review, label = item[\"sentence\"], item[\"label\"]\n",
        "  training_reviews.append(str(review.numpy()))\n",
        "  training_labels.append(label.numpy())\n",
        "\n",
        "print (\"\\nNumber of training reviews is: \", len(training_reviews))\n",
        "\n",
        "# print some of the reviews and labels\n",
        "for i in range(0, 2):\n",
        "  print (training_reviews[i])\n",
        "  print (training_labels[i])\n",
        "\n",
        "# Get the validation data\n",
        "# there's only about 800 items, so take them all\n",
        "for item in dataset_validation.take(-1):  \n",
        "  review, label = item[\"sentence\"], item[\"label\"]\n",
        "  validation_reviews.append(str(review.numpy()))\n",
        "  validation_labels.append(label.numpy())\n",
        "\n",
        "print (\"\\nNumber of validation reviews is: \", len(validation_reviews))\n",
        "\n",
        "# Print some of the validation reviews and labels\n",
        "for i in range(0, 2):\n",
        "  print (validation_reviews[i])\n",
        "  print (validation_labels[i])\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BY4ZoptJO55o"
      },
      "source": [
        "# Tokenize the words and sequence the sentences\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0TWLvXA1Oa_W"
      },
      "outputs": [],
      "source": [
        "# There's a total of 21224 words in the reviews\n",
        "# but many of them are irrelevant like with, it, of, on.\n",
        "# If we take a subset of the training data, then the vocab\n",
        "# will be smaller.\n",
        "\n",
        "# A reasonable review might have about 50 words or so,\n",
        "# so we can set max_length to 50 (but feel free to change it as you like)\n",
        "\n",
        "vocab_size = 4000\n",
        "embedding_dim = 16\n",
        "max_length = 50\n",
        "trunc_type='post'\n",
        "pad_type='post'\n",
        "oov_tok = \"<OOV>\"\n",
        "\n",
        "from tensorflow.keras.preprocessing.text import Tokenizer\n",
        "from tensorflow.keras.preprocessing.sequence import pad_sequences\n",
        "\n",
        "tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)\n",
        "tokenizer.fit_on_texts(training_reviews)\n",
        "word_index = tokenizer.word_index\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JV-Ff5N0ryWv"
      },
      "source": [
        "# Pad the sequences"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "B-3scEznH2Va"
      },
      "outputs": [],
      "source": [
        "# Pad the sequences so that they are all the same length\n",
        "training_sequences = tokenizer.texts_to_sequences(training_reviews)\n",
        "training_padded = pad_sequences(training_sequences,maxlen=max_length, \n",
        "                                truncating=trunc_type, padding=pad_type)\n",
        "\n",
        "validation_sequences = tokenizer.texts_to_sequences(validation_reviews)\n",
        "validation_padded = pad_sequences(validation_sequences,maxlen=max_length)\n",
        "\n",
        "training_labels_final = np.array(training_labels)\n",
        "validation_labels_final = np.array(validation_labels)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PahZm7YEQ8EI"
      },
      "source": [
        "# Create the model using an Embedding"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "c_nyQeI0RCCv"
      },
      "outputs": [],
      "source": [
        "model = tf.keras.Sequential([\n",
        "    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),\n",
        "    tf.keras.layers.GlobalAveragePooling1D(),  \n",
        "    tf.keras.layers.Dense(1, activation='sigmoid')\n",
        "])\n",
        "model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])\n",
        "model.summary()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3WRXrx8BRO2L"
      },
      "source": [
        "# Train the model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "oBKyVYvxRQ_9"
      },
      "outputs": [],
      "source": [
        "num_epochs = 20\n",
        "history = model.fit(training_padded, training_labels_final, epochs=num_epochs, \n",
        "                    validation_data=(validation_padded, validation_labels_final))\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HhLPbUl2AZ0y"
      },
      "source": [
        "# Plot the accurracy and loss"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "jzBM1PpJAYfD"
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\n",
        "\n",
        "\n",
        "def plot_graphs(history, string):\n",
        "  plt.plot(history.history[string])\n",
        "  plt.plot(history.history['val_'+string])\n",
        "  plt.xlabel(\"Epochs\")\n",
        "  plt.ylabel(string)\n",
        "  plt.legend([string, 'val_'+string])\n",
        "  plt.show()\n",
        "  \n",
        "plot_graphs(history, \"accuracy\")\n",
        "plot_graphs(history, \"loss\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HEbcMCVEKToB"
      },
      "source": [
        "# Write a function to predict the sentiment of reviews"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "K0nKY9M4xzWE"
      },
      "outputs": [],
      "source": [
        "# Write some new reviews \n",
        "\n",
        "review1 = \"\"\"I loved this movie\"\"\"\n",
        "\n",
        "review2 = \"\"\"that was the worst movie I've ever seen\"\"\"\n",
        "\n",
        "review3 = \"\"\"too much violence even for a Bond film\"\"\"\n",
        "\n",
        "review4 = \"\"\"a captivating recounting of a cherished myth\"\"\"\n",
        "\n",
        "new_reviews = [review1, review2, review3, review4]\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Qg-maex27KPW"
      },
      "outputs": [],
      "source": [
        "# Define a function to prepare the new reviews for use with a model\n",
        "# and then use the model to predict the sentiment of the new reviews           \n",
        "\n",
        "def predict_review(model, reviews):\n",
        "  # Create the sequences\n",
        "  padding_type='post'\n",
        "  sample_sequences = tokenizer.texts_to_sequences(reviews)\n",
        "  reviews_padded = pad_sequences(sample_sequences, padding=padding_type, \n",
        "                                 maxlen=max_length) \n",
        "  classes = model.predict(reviews_padded)\n",
        "  for x in range(len(reviews_padded)):\n",
        "    print(reviews[x])\n",
        "    print(classes[x])\n",
        "    print('\\n')\n",
        "\n",
        "predict_review(model, new_reviews)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ycJKbMq3K4iy"
      },
      "source": [
        "# Define a function to train and show the results of models with different layers"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "PevUcINXK3gn"
      },
      "outputs": [],
      "source": [
        "def fit_model_and_show_results (model, reviews):\n",
        "  model.summary()\n",
        "  history = model.fit(training_padded, training_labels_final, epochs=num_epochs, \n",
        "                      validation_data=(validation_padded, validation_labels_final))\n",
        "  plot_graphs(history, \"accuracy\")\n",
        "  plot_graphs(history, \"loss\")\n",
        "  predict_review(model, reviews)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "W8jW-OLfTrDM"
      },
      "source": [
        "# Use a CNN"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "merAu9T3TtmQ"
      },
      "outputs": [],
      "source": [
        "num_epochs = 30\n",
        "\n",
        "model_cnn = tf.keras.Sequential([\n",
        "    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),\n",
        "    tf.keras.layers.Conv1D(16, 5, activation='relu'),\n",
        "    tf.keras.layers.GlobalMaxPooling1D(),\n",
        "    tf.keras.layers.Dense(1, activation='sigmoid')\n",
        "])\n",
        "\n",
        "# Default learning rate for the Adam optimizer is 0.001\n",
        "# Let's slow down the learning rate by 10.\n",
        "learning_rate = 0.0001\n",
        "model_cnn.compile(loss='binary_crossentropy',\n",
        "                  optimizer=tf.keras.optimizers.Adam(learning_rate), \n",
        "                  metrics=['accuracy'])\n",
        "\n",
        "fit_model_and_show_results(model_cnn, new_reviews)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tXnoq0zITmSM"
      },
      "source": [
        "# Use a GRU"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6jP6KAzZTpQ6"
      },
      "outputs": [],
      "source": [
        "num_epochs = 30\n",
        "\n",
        "model_gru = tf.keras.Sequential([\n",
        "    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),\n",
        "    tf.keras.layers.Bidirectional(tf.keras.layers.GRU(32)),\n",
        "    tf.keras.layers.Dense(1, activation='sigmoid')\n",
        "])\n",
        "\n",
        "learning_rate = 0.00003 # slower than the default learning rate\n",
        "model_gru.compile(loss='binary_crossentropy',\n",
        "                  optimizer=tf.keras.optimizers.Adam(learning_rate),\n",
        "                  metrics=['accuracy'])\n",
        "\n",
        "fit_model_and_show_results(model_gru, new_reviews)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "U13JBiJUG1oq"
      },
      "source": [
        "# Add a bidirectional LSTM"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "scTUsFPAG4zP"
      },
      "outputs": [],
      "source": [
        "num_epochs = 30\n",
        "\n",
        "model_bidi_lstm = tf.keras.Sequential([\n",
        "    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),\n",
        "    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(embedding_dim)), \n",
        "    tf.keras.layers.Dense(1, activation='sigmoid')\n",
        "])\n",
        "\n",
        "learning_rate = 0.00003\n",
        "model_bidi_lstm.compile(loss='binary_crossentropy',\n",
        "                        optimizer=tf.keras.optimizers.Adam(learning_rate),\n",
        "                        metrics=['accuracy'])\n",
        "fit_model_and_show_results(model_bidi_lstm, new_reviews)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "QsxKPbCnPJTj"
      },
      "source": [
        "# Use multiple bidirectional LSTMs"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "3N6Zul47PMED"
      },
      "outputs": [],
      "source": [
        "num_epochs = 30\n",
        "\n",
        "model_multiple_bidi_lstm = tf.keras.Sequential([\n",
        "    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),\n",
        "    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(embedding_dim, \n",
        "                                                       return_sequences=True)),\n",
        "    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(embedding_dim)),\n",
        "    tf.keras.layers.Dense(1, activation='sigmoid')\n",
        "])\n",
        "\n",
        "learning_rate = 0.0003\n",
        "model_multiple_bidi_lstm.compile(loss='binary_crossentropy',\n",
        "                                 optimizer=tf.keras.optimizers.Adam(learning_rate),\n",
        "                                 metrics=['accuracy'])\n",
        "fit_model_and_show_results(model_multiple_bidi_lstm, new_reviews)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gdN2tdW4YYJ1"
      },
      "source": [
        "# Try some more reviews"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "e45UQxQl_QAI"
      },
      "outputs": [],
      "source": [
        "# Write some new reviews \n",
        "\n",
        "review1 = \"\"\"I loved this movie\"\"\"\n",
        "\n",
        "review2 = \"\"\"that was the worst movie I've ever seen\"\"\"\n",
        "\n",
        "review3 = \"\"\"too much violence even for a Bond film\"\"\"\n",
        "\n",
        "review4 = \"\"\"a captivating recounting of a cherished myth\"\"\"\n",
        "\n",
        "review5 = \"\"\"I saw this movie yesterday and I was feeling low to start with,\n",
        " but it was such a wonderful movie that it lifted my spirits and brightened \n",
        " my day, you can\\'t go wrong with a movie with Whoopi Goldberg in it.\"\"\"\n",
        "\n",
        "review6 = \"\"\"I don\\'t understand why it received an oscar recommendation\n",
        " for best movie, it was long and boring\"\"\"\n",
        "\n",
        "review7 = \"\"\"the scenery was magnificent, the CGI of the dogs was so realistic I\n",
        " thought they were played by real dogs even though they talked!\"\"\"\n",
        "\n",
        "review8 = \"\"\"The ending was so sad and yet so uplifting at the same time. \n",
        " I'm looking for an excuse to see it again\"\"\"\n",
        "\n",
        "review9 = \"\"\"I had expected so much more from a movie made by the director \n",
        " who made my most favorite movie ever, I was very disappointed in the tedious \n",
        " story\"\"\"\n",
        "\n",
        "review10 = \"I wish I could watch this movie every day for the rest of my life\"\n",
        "\n",
        "more_reviews = [review1, review2, review3, review4, review5, review6, review7, \n",
        "               review8, review9, review10]\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "cQ4YZHOjYXQ2"
      },
      "outputs": [],
      "source": [
        "print(\"============================\\n\",\"Embeddings only:\\n\", \"============================\")\n",
        "predict_review(model, more_reviews)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "NI04noTAHK5t"
      },
      "outputs": [],
      "source": [
        "print(\"============================\\n\",\"With CNN\\n\", \"============================\")\n",
        "predict_review(model_cnn, more_reviews)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "vGJ32sRUHNu6"
      },
      "outputs": [],
      "source": [
        "print(\"===========================\\n\",\"With bidirectional GRU\\n\", \"============================\")\n",
        "predict_review(model_gru, more_reviews)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "IFw9Q0iQHP2P"
      },
      "outputs": [],
      "source": [
        "print(\"===========================\\n\", \"With a single bidirectional LSTM:\\n\", \"===========================\")\n",
        "predict_review(model_bidi_lstm, more_reviews)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "zaw7fEVQHSWK"
      },
      "outputs": [],
      "source": [
        "print(\"===========================\\n\", \"With multiple bidirectional LSTM:\\n\", \"==========================\")\n",
        "predict_review(model_multiple_bidi_lstm, more_reviews)"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "name": "l10c02_nlp_multiple_models_for_predicting_sentiment.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
